Write-Up

This project has two parts:

  1. Accessing Riot Game’s API to Gather the Data.
  2. Using the data, try to predict whether a game will be won or lost.

Gathering

This was my first project I used APIs to gather data from systems! When building my script I utilized:

These python packages are wrappers for the Riot Games API. I used these as I was unfamiliar with calling APIs and these wrappers made it very easy to gather my data!

Using the Riot API and the Cassiopeia python Library, I was able to collect around 17000 match entries from the KR, EUW, and NA servers. All of these matches were from the Challenger division.

The data collection process is one of the most important steps to any future analysis, therefore I used random sampling as well as only gathering one participant of 10 from every match to ensure independence for my entries.

The process went as follows: I gathered all of the regions’ challenger players, got each players’ most recent 20 games, and randomly selected one player (does not have to be the player in question) from each match to add to my data. I also ensured the match_id was never used twice between any players’ match histories.

Analysis

Project Statement

Using logistic regression and decision trees to predict LoL game outcomes by role. Also determine what variables add most value to these models (ie factor that causes the most gain)

North America

Below are the features we will consider in our model development!

Our first steps include previewing our data, changing any data types, and our dataset into the 5 datasets. each corresponding to the positions in League of Legends:

  • Mid Lane (playmaker)
  • Top Lane (lone wolf)
  • Jungle (captain)
  • ADC (damage dealer)
  • Support (backup)
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 5760 entries, 0 to 5759
## Data columns (total 21 columns):
##  #   Column                Non-Null Count  Dtype  
## ---  ------                --------------  -----  
##  0   d_spell               5760 non-null   int64  
##  1   f_spell               5760 non-null   int64  
##  2   champion              5760 non-null   object 
##  3   side                  5760 non-null   object 
##  4   role                  5760 non-null   object 
##  5   assists               5760 non-null   int64  
##  6   damage_objectives     5760 non-null   int64  
##  7   damage_building       5760 non-null   int64  
##  8   damage_turrets        5760 non-null   int64  
##  9   deaths                5760 non-null   int64  
##  10  gold_earned           5760 non-null   int64  
##  11  kda                   5760 non-null   float64
##  12  kills                 5760 non-null   int64  
##  13  level                 5760 non-null   int64  
##  14  time_cc               5760 non-null   int64  
##  15  damage_total          5760 non-null   int64  
##  16  damage_taken          5760 non-null   int64  
##  17  total_minions_killed  5760 non-null   int64  
##  18  turret_kills          5760 non-null   int64  
##  19  vision_score          5760 non-null   int64  
##  20  result                5760 non-null   bool   
## dtypes: bool(1), float64(1), int64(16), object(3)
## memory usage: 950.6+ KB

We have no null records in any of our features (how lucky), all we have to do is convert our spell variables into datatype object.

We will revisit this later, but these graphs act as a snapshot in time for what the best players in North America were playing!

Support

Our first role we will take a deeper look at is support!

## (1149, 21)

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
assists 0 1 12.90 7.02 0 8.00 12.0 17.00 41 ▆▇▅▁▁
damage_objectives 0 1 814.34 991.86 0 89.00 502.0 1156.00 7600 ▇▂▁▁▁
damage_building 0 1 1985.46 2163.87 0 473.00 1330.0 2748.00 16552 ▇▂▁▁▁
damage_turrets 0 1 814.34 991.86 0 89.00 502.0 1156.00 7600 ▇▂▁▁▁
deaths 0 1 5.44 2.96 0 3.00 5.0 7.00 16 ▆▇▅▂▁
gold_earned 0 1 7775.28 2147.87 2980 6282.00 7655.0 9073.00 16792 ▃▇▅▁▁
kda 0 1 4.21 4.30 0 1.56 2.8 5.17 35 ▇▁▁▁▁
kills 0 1 2.40 2.37 0 1.00 2.0 3.00 17 ▇▂▁▁▁
level 0 1 11.91 2.23 6 10.00 12.0 13.00 18 ▁▃▇▃▁
time_cc 0 1 27.89 14.59 2 18.00 25.0 36.00 105 ▇▇▂▁▁
damage_total 0 1 24880.94 15387.32 3926 15616.00 20955.0 28809.00 149403 ▇▂▁▁▁
damage_taken 0 1 14690.56 6735.89 2008 9833.00 13599.0 18399.00 60004 ▇▇▁▁▁
total_minions_killed 0 1 27.14 14.97 0 17.00 27.0 35.00 126 ▇▇▁▁▁
turret_kills 0 1 0.27 0.61 0 0.00 0.0 0.00 5 ▇▁▁▁▁
vision_score 0 1 57.11 25.12 6 39.00 55.0 73.00 161 ▃▇▅▁▁

Very similar side distribution which is good, despite our study using random sampling we still should note side as unequal distributions could ignore the fact that side could play a role.

As we can see here our response variable is a bit imbalanced in favor of losses, thus using the F1 score could be beneficial as it is best for imbalanced datasets. However the difference is not very much (about 40 datapoints) thus using ROC could be our best method of determining model performance.

Visualize Correlation

Notable factors with high correlation with our response: assists(0.495), deaths(-0.411), kda(0.542), damage_objectives(0.421)

It is also worth noting that damage_turrets and damage_objectives have a correlation of 1, indicating they must have the same value. Therefore, I will be dropping damage_turrets (could have chosen either).

Also some relationships such as deaths and kda have high correlations, but is explainable as deaths is used in the expression to calculate kda.

However some other values are correlated indicating the possible existence of multi collinearity, which can cause issues with our linear and logistic models further down the line.

Data Preprocessing

Since we have no missing values there is no need to change any values. We will only be Label Encoding our categorical variables. However we will be removing the champion column (too many levels), d and f spell columns (too many levels), and the role column (we have already split the problem into roles).

# Label Encoding

# 1: red side, 0: blue side
support_data_final['side'] = support_data_final['side'].map({'Side.red': 1, 'Side.blue':0})

# 1: win, 0: loss
support_data_final['result'] = support_data_final['result'].map({True: 1, False:0})

Create Testing and Training Sets

We will be using a 80-20 train-test split

X_support = support_data_final.drop(columns=['result']).values
Y_support = support_data_final['result'].values
X_train_suppport, X_test_suppport, Y_train_support, Y_test_support = train_test_split(X_support,Y_support,
                                                                                      test_size=0.2, random_state=1337)

cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1337)

Creating Our Models

Here we will create three models:

  • Logistic Regression
  • Decision Tree
  • Random Forest
Logistic Regression

and we will be using the F1 score and Repeated K-Fold Cross Validation to evaluate our models.

support_log_model = LogisticRegression(random_state=1337, max_iter=10000)
support_log_model.fit(X_train_suppport,Y_train_support)
LogisticRegression(max_iter=10000, random_state=1337)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
support_log_f1scores = cross_val_score(support_log_model, X_train_suppport, Y_train_support,
                                       scoring="f1",cv=cv, n_jobs=-1)
support_log_ROCscores = cross_val_score(support_log_model, X_train_suppport, Y_train_support,
                                        scoring="roc_auc", cv=cv, n_jobs=-1)
print('F1: %.3f (%.3f)' % (np.mean(support_log_f1scores), np.std(support_log_f1scores)))
## F1: 0.822 (0.041)
print('ROC: %.3f (%.3f)' % (np.mean(support_log_ROCscores), np.std(support_log_ROCscores)))
## ROC: 0.908 (0.030)
support_log_ROCscores = cross_val_score(support_log_model, X_train_suppport, Y_train_support,
                                        scoring="roc_auc", cv=cv, n_jobs=-1)
Decision Tree
support_dt_model = DecisionTreeClassifier(criterion = 'entropy', random_state = 1337)
support_dt_model.fit(X_train_suppport,Y_train_support)
DecisionTreeClassifier(criterion='entropy', random_state=1337)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
support_dt_f1scores = cross_val_score(support_dt_model, X_train_suppport, Y_train_support,
                                      scoring="f1",cv=cv, n_jobs=-1)
support_dt_ROCscores = cross_val_score(support_dt_model, X_train_suppport, Y_train_support,
                                       scoring="roc_auc", cv=cv, n_jobs=-1)
print('F1: %.3f (%.3f)' % (np.mean(support_dt_f1scores), np.std(support_dt_f1scores)))
## F1: 0.777 (0.044)
print('ROC: %.3f (%.3f)' % (np.mean(support_dt_ROCscores), np.std(support_dt_ROCscores)))
## ROC: 0.786 (0.045)
Random Forest
support_rf_model = RandomForestClassifier(criterion = 'entropy', random_state = 1337)
support_rf_model.fit(X_train_suppport,Y_train_support)
RandomForestClassifier(criterion='entropy', random_state=1337)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
support_rf_f1scores = cross_val_score(support_rf_model, X_train_suppport, Y_train_support,
                                      scoring="f1",cv=cv, n_jobs=-1)

support_rf_ROCscores = cross_val_score(support_rf_model, X_train_suppport, Y_train_support,
                                       scoring="roc_auc", cv=cv, n_jobs=-1)
print('F1: %.3f (%.3f)' % (np.mean(support_rf_f1scores), np.std(support_rf_f1scores)))
## F1: 0.836 (0.036)
print('ROC: %.3f (%.3f)' % (np.mean(support_rf_ROCscores), np.std(support_rf_ROCscores)))
## ROC: 0.931 (0.022)
Predictions

Using cv in our training set we have come to the conclusion, using ROC and F1 scoring, that our Random Forest model performs the best. Therefore, we will use this model on our test set!

support_rf_model = RandomForestClassifier(criterion = 'entropy', random_state = 1337)
support_rf_model.fit(X_train_suppport,Y_train_support)
RandomForestClassifier(criterion='entropy', random_state=1337)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
support_rf_f1scores = cross_val_score(support_rf_model, X_train_suppport, Y_train_support,
                                      scoring="f1",cv=cv, n_jobs=-1)

support_rf_ROCscores = cross_val_score(support_rf_model, X_train_suppport, Y_train_support,
                                       scoring="roc_auc", cv=cv, n_jobs=-1)
print('F1: %.3f (%.3f)' % (np.mean(support_rf_f1scores), np.std(support_rf_f1scores)))
## F1: 0.836 (0.036)
print('ROC: %.3f (%.3f)' % (np.mean(support_rf_ROCscores), np.std(support_rf_ROCscores)))
## ROC: 0.931 (0.022)
Feature Importance
support_feature_importance=pd.DataFrame({
    'Random Forest':support_rf_model.feature_importances_,
    'Decision Tree':support_dt_model.feature_importances_,
    'Logistic Regression':[abs(i) for i in support_log_model.coef_[0]]
},index=support_data_final.drop(columns=['result']).columns)
support_feature_importance.sort_values(by='Random Forest',ascending=True,inplace=True)

support_feature_importance.plot(kind='barh',figsize=(12,10), width=.85, colormap='Paired', fontsize=15)

It looks like all three models placed high emphasis on KDA, with decision tree placing the most. The logistic regression model places deaths as an important feature, as well as level, assists and kills.

Wrap-Up

I will end it here for this write-up, as I have a more detailed article linked below detailing each region and position! It would be too long!

If you would like to dig further into the code, please view the repository below!

Thank you :)